Method, system and apparatus for generating structured document files

ABSTRACT

A method, system, apparatus, and graphical user interface (GUI) for generating structured document files from a document image is disclosed. Structured document files are generated by segmenting the document image into one or more zones containing respective text images, converting the respective text images to digital text, automatically identifying layout information for each of the one or more zones, labeling each of the one or more zones in accordance with a schema, and automatically associating mark-up language tags with the labeled zones to generate the structured document files responsive to the identified layout information and a model file.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is related to and claims the benefit of U.S.Provisional Application No. 60/404,581, filed Aug. 20, 2002, for “ASystem for Generating Structured Document.” In addition, thisapplication is related to U.S. patent application Ser. No. 10/293,859 ,filed Nov. 13, 2002, for “Document Classification and Labeling UsingLayout Graph Matching.”

FIELD OF THE INVENTION

The present invention relates to the field of structured languages and,more particularly, to the generation of structured language documentfiles from document images.

BACKGROUND OF THE INVENTION

Structured languages such as extensible mark-up language (XML) enablethe creation of structured document files that are easily searchable andare viewable across multiple platforms, e.g., on a desktop computer andon a cellular telephone. For example, a structured document fileretrieved via a global Information network (e.g., the Internet) can beviewed in full on a desktop computer and can be viewed as text only on acellular telephone. It is often desirable to convert existing hard copydocuments or images of documents to structured document files tofacilitate searching and displaying these documents. Accordingly,methods, systems, and apparatus for converting documents to structureddocument files are useful.

Existing documents are typically converted to structured document filesby scanning the documents and automatically converting the text withinthe scanned documents to digital text using optical characterrecognition (OCR) software. The scanned and converted documents are thenformatted, either manually or using proprietary data structures, to addmark-up language tags. Often, several different software packages areemployed to perform each of these steps. These methods for generatingstructured document files tend to be inflexible, time consuming, and/ordifficult to use. In addition, the original formatting of the documentis often lost, e.g., font sizes, emphasis, etc., making them moredifficult to read when they are displayed.

Accordingly, methods, systems, and apparatus for converting existingdocuments to structured document files are needed that are not subjectto the above limitations. The present invention fulfills this need amongothers.

SUMMARY OF THE INVENTION

The present invention is a method, system, and apparatus for generatingstructured document files from document images. Structured documentfiles are generated by segmenting the document image into one or morezones containing respective text images, converting the respective textimages to digital text, automatically identifying layout information foreach of the one or more zones, labeling each of the one or more zones inaccordance with a schema, and automatically associating mark-up languagetags with the labeled zones to generate the structured document filesresponsive to the identified layout information and a model file.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram that conceptually represents an exemplarysystem architecture for generating structured document files fromdocument images in accordance with the present invention;

FIG. 2 is a flow chart of exemplary steps for generating structureddocument files from document images in accordance with the presentinvention;

FIG. 3 is an exemplary graphical user interface (GUI) for assisting auser in generating structured document files in accordance with thepresent invention; and

FIG. 4 is an exemplary document from which structured document files aregenerated in accordance with the present invention.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 1 is a conceptual representation of an exemplary systemarchitecture 100 for generating structured document files from documentimages in accordance with the present invention. One or more blockswithin the illustrated system architecture 100 can be performed by thesame piece of hardware or module of software. It should be understoodthat embodiments of the present invention may be implemented inhardware, software, or a combination thereof. In such embodiments, thevarious component and steps described below would be implemented inhardware and/or software.

In the illustrated system architecture 100, an electronic image of adocument (the “document image”) is applied to a document processor 102.In certain exemplary embodiments, the document image is generated byscanning a physical document using conventional scanning techniques. Incertain other exemplary embodiment, the document image is supplied in anelectronic format such as a Tagged Image File Format (tiff) file, JointPhotographic Experts Group (jpeg) file, or other such file. In theseembodiments, a format converter (not shown) may be used to convert thedocument image into a format compatible with the present invention.Suitable document images and format converters for use with the presentinvention will be readily apparent to those of skill in the relatedarts.

The document processor 102 processes the document image in preparationfor labeling and generating the structured document file(s), whichactions are described in greater detail below. The illustrated documentprocessor 102 includes a segmenter 104, a text converter 106, and a zoneand text editor 108. The segmenter 104 segments the document image intozones containing text or images. For example, the segmenter 104 maycreate a zone containing the title of a document, a zone containing aparagraph within the document, and a zone containing a figure. Inaddition, the segmenter 104 determines layout information for the zones.For example, the font size and the position of the zone on the document.A suitable segmenter for use with the present invention will be readilyapparent to those of skill in the art of Image processing. Additionalinformation regarding segmenters can be found in commonly assigned U.S.Pat. Nos. 5,892,843 and 6,327,388 to Zhou et al. entitled “Title,Caption and Photo Extraction from Scanned Document Images” and“Identification of Logos from Document Images,” respectively.

In an exemplary embodiment, the segmenter 104 identifies which zonescontain text images and which zones contain figures. In certainexemplary embodiments, each zone is displayed in a color that representsthe type of information within that zone. For example, text image may bedisplayed in one color, e.g., red, and non-text images such as tablesand figures may be displayed in another color, e.g., green. In certainother exemplary embodiments, the zones may be distinguished in otherways such as with a border having a different color or pattern.

The text converter 106 converts the text images of the zones to digitaltext, i.e., text which is searchable and editable. For example, the textconverter may convert the letters with the text images to their ASCIIequivalent. In a exemplary embodiment, the text converter is aconventional optical character recognition (OCR) software tool. Suitabletext converters for use with the present invention will be readilyapparent to those of skill in the art of image processing.

The zone and text editor 108 edits the zones and the digital text. In anexemplary embodiment, the zone and text editor 108 may add zones, deletezones, or change the size of individual zones responsive to user inputs.For example, a user may enlarge a zone containing a portion of adocument title to include the entire title. In an exemplary embodiment,layout information associated with a zone is updated in accordance withthe changes to the zones. The zone and text editor 108 also may changethe digital text responsive to user inputs. For example, misspelledwords may be corrected by a user. In an exemplary embodiment, the zoneand text editor 108 receives user inputs via a graphical user interface,which is described in detail /below. Suitable zone and text editors foruse with the present invention will be readily apparent to those ofskill in the art of image processing.

The document, as processed by the document processor 102, is applied toa model selector 110. The model selector 110 selects a previouslydeveloped model file, described below, having features that resemblefeatures of the document. In an exemplary embodiment, the model selectorselects the model file from a plurality of previously developed modelfiles. Each of the model files references a schema, which describes thestructure of a document that contains valid semantics (e.g. title,author, abstract etc. for a document such as a technical paper) andincludes physical characteristics for the elements of the schema andtheir spatial relationships relative to one another.

In an exemplary embodiment, the model file is selected by a user, e.g.,via the graphical user interface (GUI) described below. In analternative exemplary embodiment, the model selector 110 comparesfeatures of the processed document image to stored features ofpreviously developed model files to automatically select a model file.In certain exemplary embodiments, a costing technique is employed with acost assigned to each feature and lower costs representing a higherlevel of resemblance. In accordance with this embodiment, a comparisoncost is determined for each available model file compared to thedocument image and the model file with the lowest cost is selected. Amethod for automatically selecting a model file by matching features isdescribed in commonly assigned U.S. patent application Ser. No.10/293,859, filed Nov. 13, 2002, for “Document Classification andLabeling Using Layout Graph Matching having at least one common Inventor(referred to herein as the “Document Classification and LabelingApplication”).

A schema editor 112 edits the schema. In an exemplary embodiment, theschema is retrieved based on a reference to the schema in the modelfile. In an alternative exemplary embodiment, the schema may bereferenced by a user, e.g., via the GUI described below. The schemaeditor 112 may be used to add or remove elements from the schemaresponsive to user inputs. In an exemplary embodiment, the schema editor112 is displayed in a tree-view and the user inputs are received via theGUI described below. A suitable schema editor will be readily apparentto those of skill in the related arts.

A model developer 114 develops the models for use by the model selector110. In an exemplary embodiment, the model developer 114 develops themodel by processing document samples. In certain exemplary embodiment,the model developer 114 develops the model responsive to user inputs. Ifthe schema is changed by the schema editor 112, the model developer 114needs to develop a new model in accordance with the new schema thataccommodates the new relations. A suitable model developer for use withthe present invention is described in the Document Classification andLabeling Application.

In an exemplary embodiment, models are developed at a system level. Whendeveloped at the system level, a user's edit and correction activitiesof logical labeling results are monitored. An automatic model learningprocess updates the document model through a feedback loop based on usermodified results. In an alternative exemplary embodiment, models aredeveloped at the user level. When developed at the user level, a GUItool is provided to allow a more knowledgeable user to manually create anew model from a set of known samples.

The document, as processed by the document processor 102, Is alsoapplied to a labeler 116. The labeler 116 applies labels to the zonesdefined by the document processor 102 in accordance with the schema. Forexample, the labeler may label a zone containing the title of thedocument with the element “title.” In an exemplary embodiment, thelabeler applies labels to the zones responsive to a document modelselected by the model selector 110.

In an exemplary embodiment, the labeler 116 automatically labels thezones using a layout graph technique. An exemplary layout graphrepresents each schema element associated with a selected model file andits spatial relationships to one or more of the other schema elementsand another exemplary layout graph represents each zone in a documentimage and its spatial relationship to one or more of the other zones. Inthe exemplary embodiment, a document image is compared to a selectedmodel by the layout graphs using a known global scale over total costmatching technique. Because some elements in a document may correspondto multiple zones, multiple zones may match the same element. A suitablelayout graph technique for use with the present invention, from whichone skilled in the art can develop a suitable labeler 116, is describedin the Document Classification and Labeling Application.

A label editor 118 enables manual editing of the labeled zones. In anexemplary embodiment, the label editor 118 updates the labels on zonesapplied automatically by the labeler 116 responsive to user inputs. Forexample, if the labeler 116 labeled a zone containing the title of thedocument with the element “author,” the label editor can be used tochange the label of that zone to the correct element, i.e., “title.” Inan alternative exemplary embodiment, the label editor 118 labels each ofthe zones manually responsive to user inputs. In an exemplaryembodiment, the label editor 118 receives user inputs via the GUIdescribed below. A suitable label editor 118 for use with the presentinvention will be readily apparent to those of skill in the art of imageprocessing.

A structured document generator 120 generates structured document filesresponsive to layout information associated with the zones, labelingresults, and the selected model file. In an exemplary embodiment, thestructured document generator 120 generates an extensible mark-uplanguage (XML) file and a extensible style-sheet language (XSL) file foreach document image that it processes. The XML file represents thedocument structure and the XSL file represents the document layout. Inan exemplary embodiment, the XSL file may represent layout informationsuch as font type and size, font color, and zone coordinates.

To develop the XML file, the exemplary structured document generator 120receives layout information from the document processor 102 and labelingresults from the labeler 116. In an exemplary embodiment, the layoutinformation contains the number of zones within the document,identification numbers for each zone, and the location of each zone. Inaddition, the structured document generator 120 receives digital textfor each zone containing a text image from the document processor 102.In certain exemplary embodiments, the document processor 102 develops alayout file that includes the layout information and the digital text.In these embodiments, the document processor 102 passes the layout fileto the structured document generator 120 for processing. In certainother exemplary embodiments, the digital text is included within thelabeling results.

The exemplary structured document generator 120 uses the labelingresults to match each zone to the appropriate schema elements. Thestructured document generator 120 then combines the layout file and thelabeling results in a manner that will be readily apparent to thoseskilled in the art of computer programming to generate the XML file. Aportion of an exemplary XML file is depicted in Table 6 below.

In certain exemplary embodiments for generating the XML file, thestructured document generator 120 also receives the model file, whichcontains the schema, from the model selector 110. The document generator120 may then validate the labeling results by comparing the labelingresults to the schema to verify that each label of the labeling resultscorresponds to a schema element. In addition, the structured documentgenerator 120 may use the model file to incorporate a complete documenttree structure into the XML file. For example, the element “name” maycontain two sub-elements, e.g., first name and last name. In thisembodiment, the structure for the sub-elements may be included in theXML file. The incorporation of the document tree structure into the XMLfile will be readily apparent to those of skill in the art of computerprogramming. Also, the structured document generator 120 may use themodel file to match individual elements to corresponding layoutinformation in the layout file, e.g., using zone coordinates containedin the layout file and in the model file.

To develop the XSL file, the exemplary structured document generator 120receives the layout information from the document processor 102, thelabeling results from the labeler 116, and the model file from the modelselector 110. Pseudo code to direct element processing to generate theXSL file is depicted in Table 1. TABLE 1 Start root of tree Repeat nodesIf leaf node; no child node if this node matches multiple zones outputxsl template using <xsl:for-each> else output xsl template using<xsl:template match> get next node Else; has child node get child nodeEndifThe pseudo code depicted in Table 1 illustrates the processing of theelements by the structured document generator 120. In a tree viewrepresentation of the schema, each element of the schema is representedas a node. Each node can have one or more child nodes. For example, alogical element “author” can have two child nodes, e.g., “last name” and“first name”, and it can have multiple instances to reflect multipleauthors. A node can also be a leaf node, which indicates there is nobranches from this node, such as “first name” or “last name.” Processingcontinues until all elements/nodes are processed.

For each element processed by the structured document generator 120, thestructured document generator 120 matches the element to correspondinglayout information in the layout file, e.g., using zone coordinatescontained in the layout file and in the model file. The structureddocument generator 120 then combines the element with the correspondinglayout information to generate the XSL file in a manner that willreadily apparent to those of skill in the art of computer programming.

In certain exemplary embodiments, a layer concept associated with thehyper text mark-up language (HTML) preserves the original layout, e.g.,using <DIV></DIV> tags in the XSL file. Each layer enclosed within the<DIV></DIV> tags is independent of every other layer. Thus, a zone inone layer has no effect on the position of a zone in another layer whenthe zones are displayed on a known web browser (not shown). Accordingly,a zone may be assigned coordinates with respect to a common origin fordisplay on a web browser without affecting the positioning of any otherzone. In addition, each zone can have its own style, e.g., font size,type, and color. In an exemplary embodiment, each zone is assigned to adifferent layer. The original coordinates for each zone are then used todevelop display coordinates in a known manner to display the zone on aweb browser. Since the original coordinates for the zones are used toposition the zones, the zones are referenced to a common origin, and thezones do not affect the position of zones in other layers, the positionof the zones when displayed on a web browser will at least partiallymatch the original layout of the original document image when all layersare displayed. Style information such as font size may also be includedto increase the resemblance between the displayed document and theoriginal document image. A portion of an exemplary XSL file is depictedin Table 7 below.

In certain exemplary embodiments, one or more of the zones may containnon-text Images (not shown) that are not converted to digital text suchas graphs, pictures, etc. In an exemplary embodiment, for each zonecontaining a non-text image the structured document generator 120generates an image file from the portion of the original image within azone. The structured document generator 120 then inserts a link to theimage file in the XML file in a manner similar to the insertion ofdigital text described above to generate the XML file. In addition, thestructured document generator 120 generates the XSL file in a similarmanner as described above for text images with the exception that styleinformation such as font size is not included.

FIG. 2 depicts a flow chart 200 of exemplary steps for generatingstructured document files in accordance with the present invention.Processing begins at block 202 with the segmentation of the documentimage into zones at block 204. At block 206, text images within thezones are converted to digital text. At block 208, the zones and digitaltext are edited. In an exemplary embodiment, the zones are segmented,digital text is converted, and zones and digital text are edited asdescribed above with reference to the segmenter 104, text converter 106,and editor 108, respectively, of FIG. 1.

At block 210, layout information for the document image is identified.The layout information includes non-content related features that definethe look of the document. These features may include, by way ofnon-limiting example, font size, emphasis formatting, positionalinformation, etc. In an exemplary embodiment the layout information isused in the generation of the structured document files such that adisplayed image of the structured document files retains at least aportion of the original layout information associated with the documentimage. Because the original layout information is maintained, thedisplayed images reflect the formatting of the original documents, thusmaking them more easy to read. In an exemplary embodiment, the layoutinformation is identified by the above-described segmenter 104 (FIG. 1).

At block 212, the zones are labeled in accordance with a schema and, atblock 212, mark-up language tags are associated with to the labeledzones to create the structured document files. In an exemplaryembodiment, the zones are labeled and the tags are associated asdescribed above with reference to the labeler 116 and the structureddocument generator 120, respectively, of FIG. 1.

FIG. 3 depicts an exemplary graphical user interface (GUI) 300 for usein the present invention. The illustrated GUI 300 includes a tool bar302, a schema panel 304, and a viewing panel 306. The GUI 300 providesan easy to user interface that allows a user to generate structureddocument files from document Images.

In an exemplary embodiment, a user accesses a workflow menu (not shown)by selecting a “workflow” indicator 308 from the tool bar 302. Incertain exemplary embodiments, the workflow menu guides the usersequentially through the structured document file generation processdescribed above, e.g., segmenting the document image into zones,converting text to digital text, labeling the zones, and generating thestructured document files. In certain other exemplary embodiments, theuser is guided through the workflow process by a “workflow” icon 310,which is described in detail below. In certain exemplary embodiments,arrow indicators 311 are available to move back and forth sequentiallythrough the workflow process. Alternatively, the entire workflow processof generating a structured document from a document image is performedautomatically by selecting an “auto execute” icon 312 in the toolbar302.

The “workflow” icon 310 displays unique images that correspond todifferent steps of the workflow process. In an exemplary embodiment, the“workflow” icon 310 reflects a next step in the workflow process toguide a user sequentially through the process of generating structureddocument files from document images. For example, prior to loading adocument Image, the “workflow” icon 310 may display the text “LoadImage,” and after the document image is loaded, but before the documentimage is segmented, the “workflow” icon 310 may display the text“Segment.” Selecting the workflow icon 310 when the text “Load Image” isdisplayed results in the loading of an image and selecting the“workflow” icon 310 when the text “Segment” is displayed results in thesegmentation of the document image.

In an exemplary embodiment, selecting an “images” icon 314 on thetoolbar 302 initiates a document image source selection, e.g., via aconventional file open window (not shown). A selected document image isthen displayed in the viewing panel 306.

In an exemplary embodiment, the selection of a document image initiatesa model file matching routine that identifies a model file for thedocument image. From the Identified model file, a schema is identifiedfor display in the schema panel 304, e.g., in a tree view.Alternatively, a user selects the schema manually by selecting a“schema” icon 316 on the toolbar 302. In certain exemplary embodiments,the user changes the automatically or manually selected schema byselecting the “schema” icon 316. In certain exemplary embodiment, theschema may be updated, e.g., elements may be added or removed from theschema, or a new schema may be created using conventional editingtechniques. Once editing is complete, the user saves the newly edited(or created) schema file. The model matching process, described above,is performed after a new schema is saved to select a model correspondingto the new schema.

Document segmentation, text conversion, and labeling are performed inthe viewing panel 306. In an exemplary embodiment, the document issegmented and text is converted responsive to the loading of a documentimage. In alternative exemplary embodiments, the document is segmentedand the text is converted by selecting the “workflow” icon 310 on thetoolbar 302 twice (once to initiate segmentation and once to initiatetext conversion) or through the workflow menu (not shown) that appearswhen the workflow indicator 308 is selected. In accordance with theseembodiments, the document is segmented Into “meaningful” zones accordingto physical attributes such as font size, spacing, etc. In theillustrated embodiment, segmented zones are displayed with boundingboxes overlaid on the original image, which can be corrected by the userusing conventional techniques. There are several features that areavailable within the viewing panel 306. These features include zoomin/out, zone selection/editing, and zone change features. In theIllustrated embodiment, text conversion results for identified textregions are also overlaid directly in each zone for easy review andediting using conventional techniques. It will be readily apparent tothose of skill in the art that segmentation and text conversion may beperformed concurrently or in two distinct steps.

After segmentation and text conversion, labels are added to thesegments. In an exemplary embodiment, labeling is initiated through itsselection from the workflow menu or by selecting the “workflow” icon310. In the illustrated embodiment, the labeling results in the displayof logical labels on the top left corner of each zone as shown in FIG.3. In an exemplary embodiment, the logical labels can be edited in aconventional manner, e.g., by “right-clicking” to display a pull-downmenu (not shown) to link and unlink the zone to a schema element or bydragging the schema elements from the scheme tree to a zone. In certainexemplary embodiments, the labels associated with the zones may be savedby selecting a “SaveLink” icon 318 on the tool bar 302.

After labeling, the structured documents are generated by selecting astructured document generation indicator In the workflow menu, selectingthe “workflow” Icon 310, or selecting a “Save XML” icon 320 on thetoolbar 302. In an exemplary embodiment, this prompts the creation oftwo structured document files: an XML file and a corresponding XSL file.

The GUI 300 additionally provides an easy to use interface that allows auser to train model files. In an exemplary embodiment, a training modeis entered by selecting this mode from the “workflow” menu or byselecting a “LeamModel” icon 322 on the toolbar 302. In the trainingmode, a user edits one or more similar sample documents. During editing,the user's edits are monitored and analyzed to develop a model file fromthe sample documents. The new model file can then be used to segment andlabel subsequent documents.

FIG. 4 depicts a document image 400 to be processed in accordance withthe present invention. Initially, the document image 400 is scannedusing conventional scanning software. The illustrated document image 400includes several blocks of text including a title 402 and authorinformation 404, e.g., name, telephone number, etc. A schema for atwo-column text document similar in style to the document image 400 isincluded in Table 2. TABLE 2 <xsd:schemaxmlns:xsd=“http://www.w3.org/1999/XMLSchema”> <xsd:elementname=“document”> <xsd:complexType> <xsd:element name=“title”type=“xsd:string”/> <xsd:element name=“leftColumnText”type=“xsd:string”/> <xsd:element name=“abstract” type=“xsd:string”/><xsd:element name=“author” type=“xsd:string”/> <xsd:elementname=“leftHeader” type=“xsd:string”/> <xsd:element name=“page”type=“xsd:string”/> <xsd:element name=“footer” type=“xsd:string”/><xsd:element name=“undefined” type=“xsd:string”/> <xsd:elementname=“copyright” type=“xsd:string”/> <xsd:element name=“rightHeader”type=“xsd:string”/> <xsd:element name=“rightColumnText”type=“xsd:string”/> </xsd:complexType> </xsd:element> </xsd:schema>The schema includes “elements” that correspond to the blocks of textwithin the document 400. For example, the element “title” corresponds tothe title 402 and the element “author” corresponds to the authorinformation 404.

A portion of the model file associated with the schema of Table 2 isillustrated in Table 3. In an exemplary embodiment, the model file,which references the schema file, i.e., twocolumn.xsd, is trained from acollection of documents. The model file contains the physicalcharacteristics of each element within the schema, their spatialrelationships, and the relative weight of the characteristics andspatial relationships. TABLE 3 <?xml version=“1.0”?> <!-- Created byjzegmdlWriteXml at 12:22:49 on Friday, 19 April 2002 --> <jzegGRAPHclass=“document” numnode=“10” th=“20” nprob=“0”> <schemainfoxsi:schemaLocation=“twoColumn.xsd”xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance” /> <jzegNODE id=“0” pos=“  238  862 1249 1477  743 1169 1012  615” wt=“ 2  1 3 0 3  1  5 0” wnull=“ 80000” label=“abstract”/> <jzegNODE id=“ 1” pos=“  673  4091920  766 1296  587 1247  357” wt=“ 0  2 0 1 2  1  0  1” wnull=“ 80000”label=“author”/> <jzegNODE id=“ 2” pos=“  198 2667 1222 2971  709 28191024  304” wt=“ 3  1 3 1 3  1 10 14” wnull=“ 80000” label=“copyright”/><jzegNODE id=“ 3” pos=“ 2021 2961 2361 2992 2191 2976  340  31” wt=“ 0 0 0 0 0  0  0  0” wnull=“  588” label=“footer”/> <jzegNODE id=“ 4”pos=“  236 1529 1252 2627  744 2078 1016 1099” wt=“ 3  0 3 1 3  1  5  0”wnull=“ 80000” label=“leftColumnText”/> <jzegNODE id=“ 5” pos=“  160  701460  127  810  99 1299  57” wt=“ 2 11 3 9 3 10 13 21” wnull=“ 80000”label=“leftHeader”/> <jzegNODE id=“ 6” pos=“ 2382 3097 2430 3136 24063116  49  39” wt=“ 3  6 3 7 3  7  9 10” wnull=“ 80000” label=“page”/><jzegNODE id=“ 7” pos=“ 1325  857 2348 2953 1836 1905 1022 2096” wt=“ 2 1 3 1 3  1  4  1” wnull=“ 80000” label=“rightColumnText”/> <jzegNODEid=“ 8” pos=“ 1932  64 2354  122 2143  93  422  58” wt=“ 1  8 2 9 2  9 1 10” wnull=“ 80000” label=“rightHeader”/> <jzegNODE id=“ 9” pos=“  509 220 2073  350 1291  285 1564  130” wt=“ 1  8 1 2 3  3  0  2” wnull=“80000” label=“title”/> <jzegEDGE id1=“ 0” id2=“ 0” ov=“−1”rel=“ 2 2 2 2 2 1 3 1 3” wt=“ 100  100 100 100 100  100  100  100 100”/><jzegEDGE id1=“ 0” id2=“ 1” ov=“−1” rel=“ 1 3 1 3 1 1 3 3 3” wt=“ 100 100 100 100 100  100  100  100 100”/> ...

A portion of an XML layout file resulting from the segmentation of thedocument image 400 and the conversion of text images to digital text isincluded in Table 4. In the illustrated embodiment, the results arestored by text lines and segmented into zones. This file containscoordinates of each zone and the coordinates and contents of each linewithin each zone. (Note: in this example, the font size information isdisabled.) TABLE 4 <PAGExmlns=“http://www.research.panasonic.com/PINTL/physical”xsi:schemaLocation=“twoColumn.xsd”xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance”> <ZONE id=“0”box=“183 74 1473 127” zone-type=“TEXT” font-size=“0”> <LINE id=“0”box=“183 74 1473 127” font-size=“0” ><![CDATA[CHI'95 MOSAIC OFCREATIVITY - May 7-11 1995 ]]> </LINE> </ZONE> <ZONE id=“1” box=“562 2252041 466” zone-type=“TEXT” font-size=“0”> <LINE id=“0” box=“610 225 2012297” font-size=“0” ><![CDATA[High-End High School Communication: ]]></LINE> <LINE id=“1” box=“583 315 2041 389”font-size=“0” ><![CDATA[Strategies and Practices of Students in a ]]></LINE> <LINE id=“2” box=“562 407 1739 466”font-size=“0” ><![CDATA[Networked EnvIronment ]]> </LINE> </ZONE> <ZONEid=“2” box=“1764 62 2368 133” zone-type=“TEXT” font-size=“0”> <LINEid=“0” box=“1803 77 2362 116” font-size=“0” ><![CDATA[DoctoralConsortium ]]> </LINE> </ZONE> <ZONE id=“3” box=“920 488 1690 833”zone-type=“TEXT” font-size=“0”> <LINE id=“0” box=“1129 499 1482 546”font-size=“0” ><![CDATA[Barry J. Fishman ]]> </LINE> <LINE id=“2”box=“922 559 1691 609” font-size=“0” ><![CDATA[School of Education andSocial Policy ]]> </LINE> <LINE id=“3” box=“1058 620 1553 668”font-size=“0” ><![CDATA[Northwestern University ]]> </LINE> <LINE id=“4”box=“1100 681 1508 723” font-size=“0” ><![CDATA[Evanston, IL 60208 ]]></LINE> <LINE id=“5” box=“1150 740 1460 785” font-size=“0” ><![CDATA[(708 ) 467 - 2405 ]]> </LINE> <LINE id=“6” box=“1044 800 1565 833”font-size=“0” ><![CDATA[bfishman@covis.nwu.edu ]]> </LINE> </ZONE> ...

An XML label file resulting from the labeling of the zones is includedin Table 5. The XML label file references the schema and the layoutfile. The XML file contains the logical association between elements inthe schema (by element name) and zones within a document layout (by zonenumber, defined in the layout file). TABLE 5 <?xml version=“1.0”?> <!--Created by jzeglogWriteXml at 16:43:10 on Monday, 07 July 2003 --><documentlayout=“C:\XMLConverter\newSeg\test\chi95\chi95o001_layout.xml”xsi:schemaLocation=“twoColumn.xsd”xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance” > <leftHeaderidref=“0”/> <title idref=“1”/> <rightHeader idref=“2”/> <authoridref=“3”/> <abstract idref=“4”/> <leftColumnText idref=“5”/><leftColumnText idref=“6”/> <leftColumnText idref=“7”/> <copyrightidref=“8”/> <rightColumnText idref=“9”/> <rightColumnText idref=“10”/><rightColumnText idref=“11”/> <rightColumnText idref=“12”/> <pageidref=“13”/> </document>

A portion of a structured document XML file is include in Table 6. Thestructured document XML file contains only document contents separatedby each logical elements. As can be seen, one logical element (e.g.,leftColumnText) in the schema) can have multiple instances (zones),identified by irefID (zone ID). TABLE 6 <?xml version=“1.0”?><?xml-stylesheet type=“text/xsl” href=“chi95o001.xsl”?> <documentxmlns:xsi=“http://www.w3.org/1999/XMLSchema-instance”xsi:noNamespaceSchemaLocation=“twoColumn.xsd”> <titleidref=“1”><![CDATA[High-End High School Communication: Strategies andPractices of Students In a Networked EnvIronment]]></title><leftColumnText idref=“5”><![CDATA[KEYWORDS: Media Spaces, Education,Communication, Design]]></leftColumnText> <leftColumnTextidref=“6”><![CDATA[INTRODUCTION Classroom are like islands, isolated homeach other and the world beyond their boundaries. Students enter anenclosed Space and for the next forty to ninety minutes, all interactionis confined to the individuals contained within the classroom walls.More often than not, the instructions strategies employed in classroomsalso isolate students from one another. Communication is comprised ofback-and-brth exchanges between teacher and student, and only rarelyfrom student to Student. This dissertation studies the deployment ofhighly interactive computer-based communication tools designed to breakthe boundaries that exist in classrooms, with the goal of elaboratingprinciples for the effective design and implementation of theseenvironments in school settings. ]]></leftColumnText> <leftColumnTextidref=“7”><![CDATA[The high school classrooms involved in this studyhave been augmented with a suite of highly interactive communicationtools, including electronic mail, Usenet newsgroups, asynchronousmultimedia notebooks, remote screen-sharing, and desktop videoteleconferencing. In the CHI community, this combination of tools hascome to be known as a media space [3,1]. Media spaces enable individualsor groups to ]]></leftColumnText> <abstract idref=“4”><![CDATA[ABSTRACTThis paper describes a study of the design of computer-basedcommunication and media space environments that support highlyinteractive school-based learning communities. The two basic questionsposed in this research are: (1) How are media space tools used bystudents in these classrooms, both in terms of the structure ofcommunications axctivity and the surrounding physical and temporalconstrains of the environment?; and (2) What are possible explanationsfor student behaviors and attitudes with regard to media space tools?The answers to these questions will provide insight for the design ofnext- generation media spaces for educational settings. ]]></abstract><author idref=“3”><![CDATA[Barry J. Fishman School of Education andSocial Policy Northwestern University Evanston, IL 60208 ( 708 ) 467-2405 bfishman@covis.nwu.edu]]> </author> ...

A portion of a structured document XSL file is included in Table 7. Thestructured document XSL file describes how each zone in the structureddocument XML file should be presented (coordinates, font size, etc.). Inan exemplary embodiment, this file is automatically generated to reflectthe original layout of the document. However, it can be modified toadapt to different display devices. For example, in an XML browser on aPDA, because of the limited display size, the font may be set to asmaller size and/or only the “abstract” element may be displayed. TABLE7 <?xml version=“1.0” encoding=“gb2312”?> <xsl:stylesheetxmlns:xsl=“http://www.w3.org/1999/XSL/Transform” version=“1.0”><xsl:template match = “/”> <html><body> <xsl:for-each select=“/document/title”> <div id=“layer1” style=“position:absolute;width:1499px; height:261px; z- index:1; left: 552px; top: 215px”> <Fontstyle=“font-size:25pt;color:#000000”> <xsl:value-of select =“text( )”/></Font></div> </xsl:for-each> <xsl:for-each select=“/document/leftColumnText”> <xsl:if test=“@idref[.=‘5’]”> <divid=“layer2” style=“position:absolute; width:1034px; height:116px; z-index:2; left: 247px; top: 1512px”> <Fontstyle=“font-size:25pt;color:#000000”> <xsl:value-of select =“text( )”/></Font></div> </xsl:if> <xsl:if test=“@idref[.=‘6’]”> <div id=“layer2”style=“position:absolute; width:1057px; height:708px; z- index:2; left:236px; top: 1660px”> <Font style=“font-size:25pt;color:#000000”><xsl:value-of select =“text( )”/> </Font></div> </xsl:if> <xsl:iftest=“@idref[.=‘7’]”> <div id=“layer2” style=“position:absolute;width:1045px; height:364px; z- index:2; left: 236px; top: 2404px”> <Fontstyle=“font-size:25pt;color:#000000”> <xsl:value-of select =“text( )”/></Font></div> </xsl:if> </xsl:for-each> <xsl:for-each select=“/document/abstract”> <div id=“layer3” style=“position:absolute;width:1044px; height:611px; z- index:3; left: 244px; top: 871px”> <Fontstyle=“font-size:25pt;color:#000000”> <xsl:value-of select =“text( )”/></Font></div> </xsl:for-each> <xsl:for-each select =“/document/author”><div id=“layer4” style=“position:absolute; width:790px; height:364px; z-index:4; left: 910px; top: 478px”> <Fontstyle=“font-size:25pt;color:#000000”> <xsl:value-of select =“text( )”/></Font></div> </xsl:for-each> ...

Although the invention has been described in terms of a documentprocessor 102, labeler 116, and structured document generator 120, it iscontemplated that the invention may be implemented in software on ageneral purpose computer (not shown). In this embodiment, one or more ofthe functions of the various components may be implemented in softwarethat controls the general purpose computer. This software may beembodied in a computer readable carrier, for example, a magnetic oroptical disk, a memory-card or an audio frequency, radio-frequency, oroptical carrier wave.

Although the invention is illustrated and described herein withreference to specific embodiments, the invention is not intended to belimited to the details shown. Rather, various modifications may be madein the details within the scope and range of equivalents of the claimsand without departing from the invention.

1. A method for generating structured document files from a documentimage, the method comprising the steps of: segmenting the document imageinto one or more zones, at least one of the one or more zones containinga respective text image; converting the respective text images withinthe at least one of the one or more zones to digital text; automaticallyidentifying layout information for each of the one or more zones;labeling each of the one or more zones in accordance with a schema; andautomatically associating mark-up language tags with the labeled zonesto generate the structured document files responsive to the identifiedlayout information and a model file.
 2. The method of claim 1, whereinthe model file is associated with the schema and wherein the labelingstep comprises at least the steps of: automatically labeling each of theone or more zones responsive to the model file.
 3. The method of claim1, further comprising the steps of: receiving editing commandscorresponding to the one or more zones; and updating the one or morezones responsive to the editing commands.
 4. The method of claim 3,wherein the step of receiving editing commands includes the step ofreceiving text editing commands and the step of updating the one or morezones includes the step of editing the digital text responsive to thetext editing commands.
 5. The method of claim 3, wherein the step ofreceiving editing commands includes the step of receiving segmentingcommands and the step of updating the one or more zones includes thestep of updating characteristics of the one or more zones responsive tothe segmenting commands.
 6. The method of claim 1, further comprisingthe step of: receiving editing commands corresponding to the schema;updating the schema responsive to the editing commands.
 7. The method ofclaim 1, wherein the respective text images are displayed on a graphicaluser interface (GUI) and wherein the converting step comprises at leastthe step of: overlaying the respective text images displayed on the GUIwith the at least one of the one or more zones with the correspondingdigital text.
 8. The method of claim 1, wherein the structured documentfiles include an XML file and an XSL file for each document image andwherein the generating steps comprises at least the step of: formatingthe XSL file such that information corresponding to each of the labeledzones in the XML file is displayed in multiple layers on a web browser.9. The method of claim 1, wherein the steps of segmenting, converting,labeling, and automatically associating mark-up language tags areperformed sequentially responsive to a selection of a workflow icon of agraphical user interface and wherein the method further comprises thestep of: updating the workflow icon to represent a next step of thesegmenting, converting, labeling, and automatically associating mark-uplanguage tags to be performed, wherein the workflow icon presents aunique image corresponding to each step.
 10. A system for generatingstructured document files from a document image, the system comprising:means for segmenting the document image into one or more zones, at leastone of the one or more zones containing a respective text image; meansfor converting the respective text images within the at least one of theone or more zones to digital text; means for automatically identifyinglayout information for each of the one or more zones; means for labelingeach of the one or more zones in accordance with a schema; and means forautomatically associating mark-up language tags with the labeled zonesto generate the structured document files responsive to the identifiedlayout Information and a model file.
 11. The system of claim 10, furthercomprising: means for receiving editing commands corresponding to theone or more zones; and means for updating the one or more zonesresponsive to the editing commands.
 12. The system of claim 10, furthercomprising: means for receiving editing commands corresponding to theschema; means for updating the schema responsive to the editingcommands.
 13. A structured mark-up language generator for generatingstructured document files from a document image, the generatorcomprising: a document processor that: a) segments the document imageinto one or more zones, at least one of the one or more zones containinga respective text image, b) s identifies layout information for each ofthe one or more zones, and c) converts the respective text images withinthe at least one of the one or more zones to digital text; a labelerthat labels each of the one or more zones in accordance with a schema;and a structured document generator that generates the structureddocument files responsive to the identified layout information and amodel file.
 14. The generator of claim 13, further comprising: an editorcoupled to the document processor that enables editing of the digitaltext and the one or more zones.
 15. The generator of claim 13, furthercomprising: an editor coupled to the labeler that enables editing of thelabels for each of the one or more zones.
 16. A graphical user interface(GUI) for generating structured document files from a document image,the GUI comprising: a document panel for displaying a document image; aschema panel for displaying a schema corresponding to the documentimage; and a workflow icon for directing the generation of at least onestructured mark-up language document from the document image, theworkflow icon reflecting a next step in a process to generate the atleast one structured mark-up language document.
 17. The GUI of claim 16,wherein the process includes sequentially performing the steps ofloading an image, segmenting the image into zones, converting textwithin the zones to digital text, labeling the zones, and generating theat least one structured document and wherein the workflow icon isupdated during the process to s present unique images corresponding toeach step to be performed in the process.
 18. A computer readable mediumincluding software that is configured to control a general purposecomputer to implement a method for generating structured document filesfrom a document image, the method comprising the steps of: segmentingthe document image into one or more zones, at least one of the one ormore zones containing a respective text image; converting the respectivetext images within the at least one of the one or more zones to digitaltext; automatically identifying layout information for each of the oneor more zones; labeling each of the one or more zones in accordance witha schema; and automatically associating mark-up language tags with thelabeled zones to generate the structured document files responsive tothe identified layout information and a model file.
 19. The computerreadable medium of claim 18, wherein the method implemented by thesoftware configured general purpose computer further comprises: updatingthe one or more zones responsive to editing commands corresponding tothe one or more zones.
 20. The computer readable medium of claim 18,wherein the method implemented by the software configured generalpurpose computer further comprises: updating the schema responsive toediting commands corresponding to the schema.